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Abstract 

One property of networks that has received 
comparatively little attention is hierarchy, 
i.e., the property of having vertices that clus- 
ter together in groups, which then join to 
form groups of groups, and so forth, up 
through all levels of organization in the net- 
work. Here, we give a precise definition of hi- 
erarchical structure, give a generic model for 
generating arbitrary hierarchical structure in 
a random graph, and describe a statistically 
principled way to learn the set of hierarchical 
features that most plausibly explain a partic- 
ular real- world network. By applying this ap- 
proach to two example networks, we demon- 
strate its advantages for the interpretation of 
network data, the annotation of graphs with 
edge, vertex and community properties, and 
the generation of generic null models for fur- 
ther hypothesis testing. 

1. Introduction 

Networks or graphs provide a useful mathematical rep- 
resentation of a broad variety of complex systems, from 
the World Wide Web and the Internet to social, bio- 
chemical, and ecological systems. The last decade has 
seen a surge of interest across the sciences in the study 
of networks, including both empirical studies of par- 
ticular networked systems and the development of new 
techniques and models for their analysis and interpre- 
tation [T1IH|. 
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Within the mathematical sciences, researchers have 
focused on the statistical characterization of network 
structure, and, at times, on producing descriptive gen- 
erative mechanisms of simple structures. This ap- 
proach, in which scientists have focused on statis- 
tical summaries of network structure, such as path 
lengths degree distributions .Si, and correla- 

tion coefficients ^T] , stands in contrast with, for exam- 
ple, the work on networks in the social and biological 
sciences, where the focus is instead on the properties 
of individual vertices or groups. More recently, re- 
searchers in both areas have become more interested 
in the global organization of networks ^3 1201 ■ 

One property of real- world networks that has received 
comparatively little attention is that of hierarchy, i.e., 
the observation that networks often have a fractal-like 
structure in which vertices cluster together into groups 
that then join to form groups of groups, and so forth, 
from the lowest levels of organization up to the level of 
the entire network. In this paper, we offer a precise def- 
inition of the notion of hierarchy in networks and give 
a generic model for generating networks with arbitrary 
hierarchical structure. We then describe an approach 
for learning such models from real network data, based 
on maximum likelihood methods and Markov chain 
Monte Carlo sampling. In addition to inferring global 
structure from graph data, our method allows the re- 
searcher to annotate a graph with community struc- 
ture, edge strength, and vertex affiliation information. 

At its heart, our method works by sampling hierar- 
chical structures with probability proportional to the 
likelihood with which they produce the input graph. 
This allows us to contemplate the ensemble of ran- 
dom graphs that are statistically similar to the origi- 
nal graph, and, through it, to measure various average 
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network properties in manner reminiscent of Bayesian 
model averaging. In particular, we can 

1. search for the maximum likelihood hierarchical 
model of a particular graph, which can then be 
used as a null model for further hypothesis test- 
ing, 

2. derive a consensus hierarchical structure from the 
ensemble of sampled models, where hierarchical 
features are weighted by their likelihood, and 

3. annotate an edge, or the absence of an edge, as 
"surprising" to the extent that it occurs with low 
probability in the ensemble. 

To our knowledge, this method is the only one that 
offers such information about a network. Moreover, 
this information can easily be represented in a human- 
readable format, providing a compact visualization 
of important organizational features of the network, 
which will be a useful tool for practitioners in gener- 
ating new hypotheses about the organization of net- 
works. 




Figure 1. A small network and one possible hierarchical or- 
ganization of its nodes, drawn as a dendrogram. 
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2. Hierarchical Structures 

The idea of hierarchical structure in networks is not 
new; sociologists, among others, have considered the 
idea since the 1970s. For instance, the method known 
as hierarchical clustering groups vertices in networks 
by aggregating them iteratively in a hierarchical fash- 
ion |19|. However, it is not clear that the hierarchical 
structures produced by these and other popular meth- 
ods are unbiased, as is also the case for the hierarchical 
clustering algorithms of machine learning 8 . That is, 
it is not clear to what degree these structures reflect 
the true structure of the network, and to what degree 
they are artifacts of the algorithm itself. This confla- 
tion of intrinsic network properties with features of the 
algorithms used to infer them is unfortunate, and we 
specifically seek to address this problem here. 

A hierarchical network, as considered here, is one that 
divides naturally into groups and these groups them- 
selves divide into subgroups, and so on until we reach 
the level of individual vertices. Such structure is most 
often represented as a tree or dendrogram, as shown, 
for example, in Figure ^ We formalize this notion 
precisely in the following way. Let G be a graph 
with n vertices. A hierarchical organization of G is a 
rooted binary tree whose leaves are the graph vertices 
and whose internal (i.e., non-leaf) nodes indicate the 
hierarchical relationships among the leaves. We de- 
note such an organization by V = {D\, ■ ■ ■ , D„_i}, 



Figure 2. An example hierarchical model 7i(D, 6), showing 
a hierarchy among seven graph nodes and the Bernoulli 
trial parameter 6i (shown as a gray-scale value) for each 
group of edges Di . 



where each Di is an internal node, and every node-pair 
(it, v) is associated with a unique Di, their lowest com- 
mon ancestor in the tree. In this way, T> partitions the 
edges of G. 

3. A Random Graph Model of 
Hierarchical Organization 

We now give a simple model H (D, 6) of the hierarchical 
organization of a network. Our primary assumption 
is that the edges of G exist independently but with 
a probability that is not identically distributed. One 
may think of this model as a variation on the classical 
Erdos-Renyi random graph, where now the probability 
that an edge (u, v) exists is given by a parameter 9i as- 
sociated with Di, the lowest common ancestor of u,v 
in D. Figure|21shows an example model on seven graph 
vertices. In this manner, a particular TL{T>, 6) repre- 
sents an ensemble of inhomogeneous random graphs, 
where the inhomogeneities are exactly specified by the 
topological structure of the dendrogram T> and the cor- 
responding Bernoulli trial parameters 9. Certainly, 
one could write down a more complicated model of 
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graph hierarchy. The model described here, however, 
is a relatively generic one that is sufficiently powerful 
to enrich considerably our ability to learn from graph 
data. 

Now we turn to the question of finding the 
parametrizations of TL{T>,6) that most accurately, or 
rather most plausibly, represent the structure that we 
observe in our real-world graph G. That is, we want 
to choose T> and 9 such that a graph instance drawn 
from the ensemble of random graphs represented by 
7Y(P, 9) will be statistically similar to G. If we al- 
ready have a dendrogram T>, then we may use the 
method of maximum likelihood jS] to estimate the pa- 
rameters 9 that achieve this goal. Let E\ be the num- 
ber of edges in G that have lowest common ancestor i 
in 2?, and let Li (i?,) be the number of leaves in the 
left- (right-) subtree rooted at i. Then, the maximum 
likelihood estimator for the corresponding parameter 
is 9i = Ei/LiRi, the fraction of potential edges be- 
tween the two subtrees of i that actually appear in 
our data G. The posterior probability, or likelihood of 
the model given the data, is then given by 

n-l 

c H (v,&) = l[(e i ) E <o—9 i ) LiR '- B ' . (i) 

i=l 

While it is easy to find values of 9i by maximum likeli- 
hood for each dendrogram, it is not easy to maximize 
the resulting likelihood function analytically over the 
space of all dendrograms. Instead, therefore, we em- 
ploy a Markov chain Monte Carlo (MCMC) method to 
estimate the posterior distribution by sampling from 
the set of dendrograms with probability proportional 
to their likelihood. We note that the number of pos- 
sible dendrograms with n leaves is super-exponential, 
growing like (2n-3)!! » y/2 (2n) n - 1 e- n where !! de- 
notes the double factorial. We find, however, that in 
practice our MCMC process mixes relatively quickly 
for networks of up to a few thousand vertices. Finally, 
to keep our notation concise, we will use £ M to denote 
the likelihood of a particular dendrogram /i, when cal- 
culated as above. 

4. Markov Chain Monte Carlo sampling 

Our Monte Carlo method uses the standard 
Metropolis-Hastings sampling scheme; we now 
briefly discuss the ergodicity and detailed balance is- 
sues for our particular application. 

Let v denote the current state of the Markov chain, 
which is a dendrogram T>. Each internal node i of the 
dendrogram is associated with three subtrees a, b, and 
c, where two are its children and one is its sibling — see 
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Figure 3. Each internal dendrogram node i (circle) has 
three associated subtrees a, b, and c (triangles), which 
together can be in any of three configurations (up to a 
permutation of the left-right order of subtrees). 



Figure|21 As the figure shows, these subtrees can be in 
one of the three hierarchical configurations. To select a 
candidate state transition v — > \i for our Markov chain, 
we first choose an internal node uniformly at random 
and then choose one of its two alternate configurations 
uniformly at random. It is then straightforward to 
show that the ergodicity requirement is satisfied. 

Detailed balance is ensured by making the standard 
Metropolis choice of acceptance probability for our 
candidate transition: we always accept a transition 
that yields an increase in likelihood or no change, 
i.e., for which £ M > £„; otherwise, we accept a tran- 
sition that decreases the likelihood with probability 
equal to the ratio of the respective state likelihoods 
Cf./Cu = e tegU-tegC». Tnis Markov chain then gen- 
erates dendrograms [i at equilibrium with probabilities 
proportional to £ M . 

5. Mixing Time and Point Estimates 

With the formal framework of our method established, 
we now demonstrate its application to two small, 
canonical networks: Zachary's karate club [22], a so- 
cial network of n = 34 nodes and m = 78 edges rep- 
resenting friendship tics among students at a univer- 
sity karate club; and the year 2000 Schedule of NCAA 
college (American) football games, where nodes rep- 
resent college football teams and edges connect teams 
if they played during the 2000 season, where n = 115 
and m = 613. Both of these networks have found use 
as standard tests of clustering algorithms for complex 
networks 1151 ITTTj and serve as a useful comparative 
basis for our methodology. 

Figure 0] shows the convergence of the MCMC sam- 
pling algorithm to the equilibrium region of model 
space for both networks, where we measure the num- 
ber of steps normalized by n 2 . We see that the Markov 
chain mixes quickly for both networks, and in practice 
we find that the method works well on networks with 
up to a few thousands of vertices. Improving the mix- 



Structural Inference of Hierarchies in Networks 



-80 

-100 

-120 

J -140 

1-160 
o 

-180 
-200 



-220 
10 




- karate, n=34 



-800 r 




-NCAA 2000, n=115 



10" 
time / n 2 



—2200 c 
10" 10"' 



10" 
time / n 2 



10" 



Figure 4. Log-likelihood as a function of the number of 
MCMC steps, normalized by n , showing rapid conver- 
gence to equilibrium. 



ing time, so as to apply our method to larger graphs, 
may be possible by considering state transitions that 
more dramatically alter the structure of the dendro- 
gram, but we do not consider them here. Addition- 
ally, we find that the equilibrium region contains many 
roughly competitive local maxima, suggesting that any 
particular maximum likelihood point estimate of the 
posterior probability is likely to be an overfit of the 
data. However, formulating an appropriate penalty 
function for a more Bayesian approach to the calcula- 
tion of the posterior probability appears tricky given 
that it is not clear to how characterize such an overfit. 
Instead, we here compute average features of the den- 
drogram over the equilibrium distribution of models 
to infer the most general hierarchical organization of 
the network. This process is described in the following 
section. 

To give the reader an idea of the kind of dendrograms 
our method produces, we show instances that corre- 
spond to local maxima found during equilibrium sam- 
pling for each of our example networks in Figures |5] 
(top) and IU (top). For both networks, we can validate 
the algorithm's output using known metadata for the 
nodes. During Zachary's study of the karate network, 
for instance, the club split into two groups, centered 
on the club's instructor and owner (nodes 1 and 34 re- 
spectively), while in the college football schedule teams 
are divided into "conferences" of 8-12 teams each, with 
a majority of games being played within conferences. 
Both networks have previously been shown to exhibit 
strong community structure E| j an d our dendro- 
grams reflect this finding, almost always placing leaves 



with a common label in the same subtree. In the case 
of the karate club, in particular, the dendrogram bipar- 
titions the network perfectly according to the known 
groups. Many other methods for clustering nodes in 
graphs have difficulty correctly classifying vertices that 
lie at the boundary of the clusters; in contrast, our 
method has no trouble correctly placing these periph- 
eral nodes. 

6. Consensus Hierarchies 

Turning now to the dendrogram sampling itself, we 
consider three specific structural features, which we 
average over the set of models explored by the MCMC 
at equilibrium. First, we consider the hierarchical re- 
lationships themselves, adapting for the purpose the 
technique of majority consensus, which is widely used 
in the reconstruction of phylogenetic trees 0] . Briefly, 
this method takes a collection of trees {Ti, T2, . . . , Tfc} 
and derives a majority consensus tree T roa j contain- 
ing only those hierarchical features that have majority 
weight, where we somehow assign a weight to each 
tree in the collection. For our purposes, we take the 
weight of a dendrogram T> simply to be its likeli- 
hood C d , which produces an averaging scheme similar 
to Bayesian model averaging [Sj. Once we have tabu- 
lated the majority- weight hierarchical features, we use 
a reconstruction technique to produce the consensus 
dendrogram. Note that T m& i is always a tree, but is 
not necessarily strictly binary. 

The results of applying this process to our example 
networks are shown in Figures |S] (bottom) and (bot- 
tom) . For the karate club network, we observe that the 
bipartition of the two clusters remains the dominant 
hierarchical feature after sampling a large number of 
models at equilibrium, and that much of the partic- 
ular structure low in the dendrogram shown in Fig- 
ure (top) is eliminated as distracting. Similarly, we 
observe some coarsening of the hierarchical structure 
in the NCAA network, as the relationships between 
individual teams are removed in favor of conference 
clusterings. 

7. Edge and Node Annotations 

We can also assign majority- weight properties to nodes 
and edges. We first describe the former, where we 
assign a group affiliation to each node. 

Given a vertex, we may ask with what likelihood it is 
placed in a subtree composed primarily of other mem- 
bers of its group (with group membership determined 
by metadata as in the examples considered here) . In a 
dendrogram T>, we say that a subtree rooted at some 
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(a) (b) 

Figure 5. Zachary's karate club network: (a) an exemplar maximum likelihood dendrogram with log C = —73.32, param- 
eters 8i are shown as gray-scale values, and leaf shapes denote conference affiliation; and (b) the consensus hierarchy 
sampled at equilibrium. Leaf shapes are common between (a) and (b), but position varies. 



node i encompasses a group g if both the majority of 
the descendants of i are members of group g and the 
majority of members of group g are descendants of i. 
We then assign every leaf below i the label of g. We 
note that there may be some leaves that belong to no 
group, i.e., none of their ancestors simultaneously sat- 
isfy both the above requirements, and vertices of this 
kind get a special no-group label. Again, by weight- 
ing the group-affiliation vote of each dendrogram by its 
likelihood, we may measure exactly the average proba- 
bility that a node belongs to its native group's subtree. 

Second, we can measure the average probability that 
an edge exists, by taking the likelihood-weighted aver- 
age over the sequence of parameters 9i associated with 
that edge at equilibrium. 

Estimating these vertex and edge characteristics al- 
lows us to annotate the network, highlighting the most 
plausible features, or the most surprising. Figures 
and |S1 show such annotations for the two example net- 
works, where edge thickness is proportional to average 
probability, and nodes are shaded proportional to the 
sampled weight of their native group affiliation (light- 
est corresponds to highest probability). 

For the karate network, the dendrogram sampling both 
confirms our previous understanding of the network 
as being composed of two loosely connected groups, 
and adds additional information. For instance, node 




Figure 7. An annotated version of the karate club network. 
Line thickness for edges is proportional to their average 
probability of existing, sampled at equilibrium. Vertices 
have shapes corresponding to their known group associa- 
tions, and are shaded according to the sampled weight of 
their being correctly grouped (see text). 



17 and the pair {25, 26} are found to be more loosely 
bound to their respective groups than other vertices - 
a feature that is supported by the average hierarchical 
structure shown in Figure (bottom) . This looseness 
apparently arises because none of these vertices has a 
direct connection to the central players 1 and 34, and 
they are thus connected only secondarily to the cores 
of their clusters. Also, our method correctly places 
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Figure 6. The NCAA Schedule 2000 network: (a) an exemplar maximum likelihood dendrogram with log£ = —884.2, 
parameters 9i axe shown as gray-scale values, and leaf shapes denote conference affiliation; and (b) the consensus hierarchy 
sampled at equilibrium. Leaf shapes are common between (a) and (b), but position varies. 



vertex 3 in the cluster surrounding 1, a placement with 
which many other methods have difficulty. 

The NCAA network shows similarly suggestive results, 
with the majority of heavily weighted edges falling 
within conferences. Most nodes are strongly placed 
within their native groups, with a few notable excep- 
tions, such as the independent colleges, vertices 82, 
80, 42, 90, and 36, which belong to none of the ma- 
jor conferences. These teams are typically placed by 
our method in the conference in which they played the 
most games. Although these annotations illustrate in- 
teresting aspects of the NCAA network's structure, we 
leave a thorough analysis of the data for future work. 

8. Discussion and conclusions 

As mentioned in the introduction, we are not the first 
to study hierarchy in networks. In addition to persis- 
tent interest in the sociology community, a number of 
authors in physics have recently discussed aspects of 
hierarchical structure El EI) although generally 

via indirect or heuristic means. A closely related, and 
much studied, concept is that of community structure 
in networks 1151 1131 [5] . In community structure cal- 
culations one attempts to find a natural partition of 
the network that yields densely connected subgraphs 



or communities. Many algorithms for detecting com- 
munity structure iteratively divide (or agglomerate) 
groups of vertices to produce a reasonable partition; 
the sequence of such divisions (or agglomerations) can 
then be represented as a dendrogram that is often con- 
sidered to encode some structure of the graph itself. 
(Notably, a very recent exception among these com- 
munity detection heuristics is a method based on max- 
imum likelihood and survey propagation 9 .) 

Unfortunately, while these algorithms often produce 
reasonable looking dendrograms, they have the same 
fundamental problems as traditional hierarchical clus- 
tering algorithms for numeric data [S]. That is, it is 
not clear to what extent the derived hierarchical struc- 
tures depend on the details of the algorithms used to 
extract them. It is also unclear how sensitive they are 
to small perturbations in the graph, such as the ad- 
dition or removal of a few edges. Further, these algo- 
rithms typically produce only a single dendrogram and 
provide no estimate of the form or number of plausible 
alternative structures. 

In contrast to this previous work, our method directly 
addresses these problems by explicitly fitting a hier- 
archical structure to the topology of the graph. We 
precisely define a general notion of hierarchical struc- 
ture that is algorithm-independent and we use this 
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Figure 8. An annotated version of the college football schedule network. Annotations are as in Figure [7] Note that node 
shapes here differ from those in Figure |S| but numerical indices remain the same. 



definition to develop a random graph model of a hi- 
erarchically structured network that we use in a sta- 
tistical inference context. By sampling via MCMC 
the set of dendrogram models that are most likely 
to generate the observed data, we estimate the pos- 
terior distribution over models and, through a scheme 
akin to Bayesian model averaging, infer a set of fea- 
tures that represent the general organization of the 
network. This approach provides a mathematically 
principled way to learning about hierarchical organi- 
zation in real-world graphs. Compared to the previous 
methods, our approach yields considerable advantages, 
although at the expense of being more computation- 
ally intensive. For smaller graphs, however, for which 
the calculations described here are tractable, we be- 
lieve that the insight provided by our methods makes 
the extra computational effort very worthwhile. In fu- 



ture work, we will explore the extension of our meth- 
ods to larger networks and characterize the errors the 
technique can produce. 

In closing, we note that the method of dendrogram 
sampling is quite general and could, in principle, be 
used to annotate any number of other graph features 
with information gained by model averaging. We be- 
lieve that the ability to show which network features 
are surprising under our model and which are com- 
mon is genuinely novel and may lead to a better un- 
derstanding of the inherently stochastic processes that 
generate much of the network data currently being an- 
alyzed by the research community. 
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